Deterministic algorithms for skewed matrix products 

Konstantin Kutzkov 

IT University Copenhagen 
Denmark 

kutzkov@gmail . com 



Abstract 

Recently, Pagh presented a randomized approximation algorithm for the multiplication of real-valued 
matrices building upon work for detecting the most frequent items in data streams. We continue this line 
of research and present new deterministic matrix multiplication algorithms. 

Motivated by applications in data mining, we first consider the case of real-valued, nonnegative 
n-by-n input matrices A and B, and show how to obtain a deterministic approximation of the weights of 
individual entries, as well as the entrywise p-norm, of the product AB. The algorithm is simple, space 
efficient and runs in one pass over the input matrices. For a user defined b £ (0, n 2 ) the algorithm runs in 
time 0(nb + n ■ Sort(n)) and space 0(n + b) and returns an approximation of the entries of AB within 
an additive factor of || AB||_ei/6, where ||C||bi = . |CV, | is the entrywise 1-norm of a matrix C and 
Sort(n) is the time required to sort n real numbers in linear space. Building upon a result by Berinde 
et al. we show that for skewed matrix products (a common situation in many real-life applications) 
the algorithm is more efficient and achieves better approximation guarantees than previously known 
randomized algorithms. 

When the input matrices are not restricted to nonnegative entries, we present a new deterministic 
group testing algorithm detecting nonzero entries in the matrix product with large absolute value. The 
algorithm is clearly outperformed by randomized matrix multiplication algorithms, but as a byproduct we 
obtain the first 0(n 2+E )-time deterministic algorithm for matrix products with 0(yfn) nonzero entries. 

1 Introduction 

The complexity of matrix multiplication is one of the fundamental problems in theoretical computer science. 
Since Strassen's sub-cubic algorithm for matrix multiplication over a ring from the late 1960's [32], the topic 
has received considerable attention, see [5] for a historical overview on the subject. It is conjectured that 
matrix multiplication admits an algorithm running in time 0(n 2+e ) for any e > 0. For more than 20 years 
the record holder was the algorithm by Coppersmith and Winograd |13j running in time 0(n 2 ). Recently 
two results improving on [13] were announced. In his PhD thesis Stothers [31] presents a refinement of 
the Coppersmith- Winograd algorithm running in time 0(n 2 3737 ) and Vassilevska Williams [34] developes 
a general framework for obtaining a tighter upper bound on the complexity of the Coppersmith- Winograd 
algorithm. The latter yields the best known bound of 0(n 2 ' 3727 ). Several algorithms computing exactly the 
matrix product for special classes have been designed. For example, algorithms with running time better than 
0(n 2 ' 3727 ) are known for Boolean matrix multiplication with sparse output [53] or for the case when the input 
or output matrices are sparse [2] [33] . In a recent work Iwen and Spencer 122] present a new class of matrices 
whose product can be computed in time 0(n 2+e ) by a deterministic algorithm: namely when the output 
matrix is guaranteed to contain at most n °- 29i62 non-zero entries in each column (or by symmetry row). All 
improved algorithms use as a black-box the algebraic matrix multiplication algorithm which, unfortunately, is 
only of theoretical importance. It uses sophisticated algebraic approaches resulting in large constants hidden 
in the big-Oh notation and does not admit an efficient implementation. This motivates the need of simple 
"combinatorial-like" algorithms. 
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Approximate matrix multiplication. The first approximation algorithm with rigorously understood 
complexity by Cohen and Lewis is based on sampling |12j . For input matrices with nonnegative entries they 
show a concentration around the estimate for individual entries in the product matrix with high probability. 

The amount of data to be handled has been growing at a faster rate than the available memory in modern 
computers. Algorithms for massive data sets, where only sequential access to the input is allowed, have 
become a major research topic in computer science in the last decade. Drineas et al. |17j first recognized 
the need for memory-efficient methods for matrix multiplication when access to single columns and rows of 
the input matrices is possible. They present a randomized algorithm for approximating the product of two 
matrices based on sampling. The complexity of the algorithm as well as its accuracy depend on user-defined 
parameters. The algorithm is "pass-efficient" since columns and rows of the input matrices are sequentially 
loaded into memory. The obtained approximation guarantees are expressed in terms of the Frobenius norm 
of the input matrices and the user-defined parameters but their method does not result in a strong guarantee 
for individual entries in the output matrix. Sarlos [29] observed that instead of sampling rows and columns 
one can use random projections to obtain a sketch of the matrix product. A notable difference to [17] is 
that by sketching one obtains an additive error for each individual entry depending on the 2-norm of the 
corresponding row and column vector in the input matrices. 

Recently Pagh [27 introduced a new randomized approximation algorithm. Instead of sketching the input 
matrices and then multiplying the resulting smaller matrices, we treat the product as a stream of outer 
products and sketch each outer product. Using Fast Fourier Transformation in a clever way, Pagh shows how 
to efficiently adapt the Count-Sketch algorithm [TU] to an outer product. The algorithm runs in one pass 
over the input matrices and provides approximation guarantees in terms of the Frobenius norm of their product. 

Our contribution. 

• A new algorithm for the case where the input matrices consist of nonnegative entries only. This is the 
first nontrivial deterministic approximation algorithm for the multiplication of nonnegative matrices in 
a streaming setting. Motivated by practical applications, we analyze the approximation guarantee and 
the algorithm complexity under the assumption that the entries adhere to Zipfian distribution. We 
compare it to previously known randomized algorithms and show that for certain natural settings it is 
more efficient and achieves better approximation guarantees. 

• We present a new matrix multiplication algorithm for arbitrary real-valued input matrices by adapting 
the group testing algorithm for streams with updates in the turnstile model outlined in [26; for detecting 
the entries with large absolute value in matrix product. As a byproduct we obtain the first deterministic 
algorithm running in 0(n 2+£ ) steps for matrix products with 0(y/n) nonzero entries. 

Note that our algorithms easily generalize to rectangular matrix multiplication but for the ease of 
presentation we consider the case of square input matrices. Also, we will state the time complexity of our 
first algorithm using a function Sort(n) denoting the running time of a linear space deterministic sorting 
algorithm. Clearly, Sort(n) = 0(n log n) for comparison based sorting but under some assumptions on the 
elements to be sorted also better bounds are known, e.g. the O(nloglogn) time integer sorting algorithm by 
Han [20]. 

2 Preliminaries 
2.1 Definitions 

Linear algebra. Let R+ denote the field of nonnegative real numbers. Given matrices A, B £ K + " xn we 
denote their product by C := AB. The ith row of a matrix A is written as Aj*, the jth column as A*j. We 
use the term entry to identify a position in the matrix, not its value. Thus, the weight of the entry in 
A is the value in the ith row and jth column, Ay, i,j £ [n], for [n] := {0, 1, . . . , n — 1}. When clear from 
the context however, we will omit weight. For example, nonzero entries will refer to entries whose weight is 
different from and by heavy entries we mean entries with large weight. 

The outer product of a column vector u € K+™ and a row vector v € M+™ is a matrix uv € IR + rlxn such 
that uVij = UiVj, i,j 6 [n]. The rank of a positive real number a £ M. + in a matrix A, denoted as r^(a), is 
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the number of entries strictly smaller than a, plus 1. Note that a does not need to be present in A. 

The p-norm of a vector u € R n is = QZi=i f° r P > 0. Similarly, we define the entrywise p- 

norm of a matrix A S M nx ™ as := . £ r , |As ) j| p ) 1 / 1 ' for p EN. The case p — 2 is the Frobenius norm 

of A denoted as ||A||^. The k-residual entrywise p-norm \\A\\ E k p is the entrywise p-norm of the matrix ob- 
tained from A after replacing the k entries with the largest absolute values in A, ties resolved arbitrarily, with 0. 

Data streaming. Our algorithms have strong connection to data streaming, therefore we will use the 
respective terminology. A stream S is a sequence of iV updates (i, v) for items i el and »6l. We assume 
X — [n]. The frequency of i is fa = Y^u v )gs v ano - f S = (/d ■ • • i fn-i) is the frequency vector of the stream 
S 1 . The insert-only model assumes w > for all updates and in the non-strict turnstile model there are no 
restrictions on v and the values in fg |26j . Similarly to matrix entries, we will also refer to the frequency 
of an item i as the weight of i. Items with weight above ||/||i/&, for a user-defined &, will be called b-heavy 
hitters or just heavy hitters when b is clear from the context. Ordering the items in S according to their 
absolute weight, the heaviest 6 items in S are called the top-b entries in S. 

Skewed distributions. A common formalization of the skewness in real-life datasets is the assump- 
tion of Zipfian distribution. The elements in a given set M over N elements with positive weights follow 
Zipfian distribution with parameter z > if the weight of the element of rank i is ^y|r where £(z) = Yli=i i~ z 
and \M\ denotes the total weight of elements in M. We will analyze only the case when the skew in the data 
is not light and z > 1. For z > 1, Yli=i * z converges to a small constant. We will also use the facts that for 
* > 1, EL+i *"* = Oib 1 - 2 ) and for z > 1/2, £^ 6+1 i^ = O^ 1 - 2 *). 

2.2 The column row method and memory efficient matrix multiplication 

The naive algorithm for the multiplication of input matrices A, B G M" x ™ works by computing the inner 
product of Ai^ and S*j in order to obtain ABij for all i, j G [n]. An alternative view of the approach is 
the column row method computing the sum of outer products J^j e r n i A^j-B^*. While this approach does 
not yield a better running time, it turns out to admit algorithmic modifications resulting in more efficient 
algorithms. Schnorr and Subramanian [30 and Lingas [24] build upon the approach and obtain faster 
algorithms for Boolean matrix multiplication. Assuming that A is stored in column-major order and B in 
row-major order, the approach yields a memory efficient algorithm since the matrix product AB can be 
computed in a single scan over the input matrices. Recently, Pagh [27] presented a new randomized algorithm 
combining the approach with frequent items mining algorithms [I] 110] . Inspired by this, we present another 
approach to modify the column-row method building upon ideas from deterministic frequent items mining 
algorithms [H ES 113 US . 

2.3 Skewed matrix products 

In [57] Pagh discusses several applications where one is interested only in the entries with the largest absolute 
value in the matrix product. We argue that the restriction the input matrices to be nonnegative is feasible 
for many real-life problems. Since our first algorithm is both simple and efficient, we thus believe that it will 
have practical value. 

In [12] the authors discuss applications of nonnegative matrix multiplication for pattern recognition 
problems. The presented approximation algorithm is based on sampling and is shown to efficiently detect 
pairs of highly similar vectors in large databases for the case where the majority of vector pairs have low 
similarity. We present two concrete applications of nonnegative matrix multiplication where the entries in the 
output matrix adhere to a skewed distribution. 

A major problem in data mining is the detection of pairs of associations among items in transactional 
databases, see Chapter 5 in [5T] for an overview. In this problem, we are given a database of m transactions, 
where a transaction is a subset of the ground set of items X. We are interested in items i,j£X such that an 
occurrence of i in a given transaction implies with high probability an occurrence of j in the transaction, 
and vice versa. The lift similarity measure [7] formalizes the above intuition. Let Si and Sj be the set of 
transactions containing the items i and j, respectively. Then the lift similarity between i and j is defined as 
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jg Hg'j , The database is usually huge and does not fit in memory, therefore one is interested in solutions 
performing only a few scans of the database. There are randomized algorithms for the problem [9J [TT] but to 
the best of our knowledge there is no known deterministic algorithm improving upon the straightforward 
solution. 

The problem can be reduced to matrix multiplication as follows. In a first pass over the database we 
compute the exact number of occurrences fi for each item i. Then we create an n x m matrix A such that 
Aij = 1/ fi if and only if the item i occurs in the jth transaction. Now it is easy to see that the weight of the 
entry in the product AA T is equal to lift similarity between i and j. The nonzero entries in the jth 
column in A correspond to the items in the jth transaction, thus the column row approach is equivalent to 
computing the stream of Cartesian products for all transactions such that each occurrence of the the items 
pair (i,£) has weight l/{f l fi) 

Figure [T] presents the distribution of lift similarities among pairs for two real datasets: pumsb, containing 
census data for population and housing, and accidents, built from data about traffic accidents in Belgium. Both 
datasets are publicly available at the Frequent Itemset Mining Dataset Repository (| http://fimi.ua.ac.be/data7| - 
Note that the similarities are plotted on a doubly logarithmic scale and except for a few highly correlated 
pairs, the similarities among item pairs decrease almost linearly. Therefore, the assumption for Zipfian 
distribution is realistic. 

As a second example consider the popular Markov Clustering algorithm (MCL) for the discovery of 
communities within networks |16j . In a nutshell, for a given graph the algorithm first creates a column 
stochastic matrix M based on the graph structure. The algorithm then works iteratively. In each iteration 
we update M to the product MM and then set small entries to 0. A significant proportion of the entries 
in each column is small. M converges fast to a sparse matrix indicating clusters around certain vertices in 
the original graph. The most time consuming steps are the first few iterations, thus an algorithm for the 
detection of the significant entries in matrix products can improve the time complexity. In an additional pass 
one can compute the exact weight of the significant entries. 
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Figure 1: The similarity distribution among pairs for two datasets on a log-log scale. 



3 An algorithm for nonnegative matrix products 
3.1 Intuition and key lemma. 

Recall first how the Majority algorithm [6 works. We are given a multiset M of cardinality N and want to 
find a majority element, i.e. an element occurring at least N/2 + 1 times in M. While there are two distinct 
objects in M we remove them from M. It is easy to see that if there exists a majority element a, at the end 
only occurrences of a will be in M. 

The Frequent algorithm p~5|. l23l [25] builds upon this simple idea and detects &-heavy hitters in an 
unweighted stream S of N updates (i, 1) for items % € [n]. We keep a summary of b distinct entries together 
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with a counter lower bounding their weight. Whenever a new item i arrives we check whether it is already 
in the summary and, if so, update the corresponding counter. Otherwise, if there is an empty slot in the 
summary we insert i with a counter set to 1. In the case all b slots are occupied we decrease the weight of 
all items by 1 and proceed with the next item in the stream. The last step corresponds to removing 6+1 
distinct items from the multiset of items occurring in S and a simple argument shows that 6-heavy hitters 
will be in the summary after processing the stream. By returning the estimated weight of the item in the 
summary and for not recorded items, the weight of each item is underestimated by at most ||f||i/6 where f 
is the frequency vector of the stream. Implementing the summary as a hash table and charging the cost of 
each item deletion to the cost incurred at its arrival the expected amortized cost per item update is constant. 
A sophisticated approach for decreasing the items weights in the summary leads to a worst case constant 
time per item update [T5J [23] . 

Generalizing to nonnegative matrix multiplication by the column row method is intuitive. Assume the 
input matrices consist of {0,1}- valued entries only. We successively generate the n outer products and run the 
Frequent algorithm on the resulting stream associating entries with items. There are several problems to 
resolve: First, we want to multiply arbitrary nonnegative matrices, thus our algorithm has to handle weighted 
updates. Second, we have to consider 0(n 3 ) occurrences of weighted items in the stream. Third, we cannot 
apply any more the amortized argument for the running time analysis since a group of b — 1 heavy items 
might be followed by many lighter items causing expensive updates of the summary and it is not obvious how 
to extend the deterministic approach from [15] [23] guaranteeing constant time updates in the worst case. 

The first issue is easily resolved by the following 

Lemma 1 Let f be the frequency vector of an insert only stream S over a domain [n] . After successively 
decrementing t times the weight of at least b distinct items by Aj > 0, 1 < i < t, such that at each step fa > 
for all < i < n — X, it holds fa > for all b-heavy hitters, k £ [n], for all t G N. 

Proof: Since fa > for all i € [n] holds, the total decrease is bounded by ||f ||i. A decrement of A, in the 
weight of a given item is witnessed by the same decrement in the weights of at least 6—1 different items. 
Thus, we have &X}j=i Aj < ||f ||i which bounds the possible decrease in the weight of a heavy hitter to ||f \\i/b. 
□ 

In the next section we show that the specific structure of an outer product allows us to design efficient 
algorithms resolving the last two issues. 

3.2 The algorithm. 

We assume that A £ M™ xn is stored in column- major order and B £ M" xn in row-major order. We show 
how to modify the column row method in order to obtain an additive approximation of each entry in AB in 
terms of the entry wise 1-norm of AB. 

Essentially, we run the Frequent algorithm for the stream of n outer products: we keep a summary S of 
b distinct items and for each outer product we want to update the summary with the incoming weighted 
entries over the domain [n] x [n). The main difference is that for b = o(n 2 ) we can use the specific structure 
of an outer product and update the summary in o(n 2 ) steps. In ComputeSummary in Figure [2] for each of 
the n outer products we simulate the successive application of Lemma [T] until at most b entries with weight 
larger than remain in the outer product. We then update S with the remaining entries. 

Correctness. 

Lemma 2 Let w be the weight of an entry in the product C — AB. After termination of ComputeSum- 
mary for the estimated weight w of w returned by EstimateEntry, i,j £ [n], holds max(w — \\C\\E 1 /b, 0) < 
w < w. 

Proof: The product AB equals J27=o ai ' ^* f° r * ne columns do, . . . , a„_i of A and the rows bo, ... , 6„-i of 
B. We consider each outer product as n 2 updates for different entries over the domain [n] x [n] in an insert 
only stream with positive real weights. We show how the algorithm updates the summary for a single outer 
product R. First, in line 3 the algorithm finds the entry of rank b + 1 in R. In line 4 we decrease the weight 
of the b largest entries by w^ +1 which yields the same result as the following iterative procedure: While there 
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function ComputeSummary 

Input: matrices A,B £ R + nxn , summary S for 6 entries 
1: for i £ [n] do 

2: Denote by R := A+j ■ 5^* the outer product of the ith column of A and zth row of B 
3: Find the weight w^ +1 of the entry of rank b + 1 in R 

4: Let C be the b entries in R with rank less than b + 1, i.e. the largest b entries 
5: Decrease the weight of each entry in C by w^ +1 
6: for each entry e entry occurring in S and C do 
7: add e's weight in C to e's weight in S 

8: remove e from C 

9: Find the weight wf¥f of the entry of rank 6+1 in <S U C, if any 

10: Update S to contain the largest b entries in S U C and decrease their weight by wfVf 



function EstimateEntry 

Input: Entry 
1: if (i, j) is in the summary S then 
2: return the weight of (i,j) in S 
3: else 

4: return 



Figure 2: A high-level pseudocode description of the algorithm. In ComputeSummary we iterate over the n 
outer products and to each one of them apply Lemma [I] such that only the b heaviest entries remain. We 
update the summary with the entries output by the outer product. After processing the input matrices we 
can estimate the weight of an individual entry by checking the summary. 



are at least 6 + 1 nonzero entries in R, find the entry with smallest weight w m i n in R and decrease the weight 
of all non-zero entries by w m i n . Equivalence holds because we always decrease the weight of an entry with 
the smallest weight and thus the decrease of the largest 6 entries weights can never exceed i«S r Also, the 
decrease can not be smaller than wf^ +1 since otherwise we would have more than 6 non-zero entries in the 
outer product. Thus, we always decrease by the same amount the weight of at least 6+1 different entries 
which by Lemma [l] guarantees the claimed approximation error. In lines 6-10 we apply essentially the same 
procedure again for the nonzero entries in the outer product and the entries in the summary. The remaining 
at most 6 nonzero entries constitute the updated summary. □ 



Running time. In the following lemmas we present efficient deterministic algorithms for the subroutines 
used in ComputeSummary. We concentrate how the algorithm updates the summary for a single outer 
product. Before presenting our approach, we give the main building blocks that will be used to achieve an 
efficient solution. 

Lemma 3 Given two sorted vectors u,v £ R + n we can find the entry of rank b in the outer product uv in 
time and space 0(n). 

Proof: We reduce the problem to selection of the element of rank 6 in a Cartesian sum X + Y = {x + y:x£ 
X,y £Y} for sorted sets of real numbers X and Y. Setting U — {\ogUi : m £ u} and V — {log : Vi £ v} 
and searching in the Cartesian sum U + V for the element of rank 6 corresponds to searching for the entry 
of rank 6 in the outer product uv, this follows from monotonicity of the log : K+ — > K function. The best 
known deterministic algorithm for selection in a Cartesian sum of two sorted sets |18j runs in time and space 
0(n). □ 

Lemma 4 Given vectors u, v £ with elements sorted in descending order, we can output an implicit 

representation of the largest 6 elements from the outer product uv in a data structure C in time and space 
0(n). 

Proof: Assume we have found the entry of rank 6 as outlined in Lemma [3j let this element be c. Let i, j be 
two pointers for u and v respectively. Initialize i = 0,j = n — 1. Assume i is fixed. We compare c to UiVj. 
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While it is larger or equal, we move left u's pointer by decreasing j by 1. At the end we add the pair (i,j) 
to £, denoting that the entries in ith row of uv bigger than c, and thus of rank less than b, are all (i, i) for 
i < j. Then we go to the next row in uv by incrementing i and repeat the above while-loop starting with the 
current value of j. When i — n or j — we know that all entries smaller than c have been found. Correctness 
is immediate since the product UiVj is monotonically increasing with i and decreasing with j, and thus for 
each row of the outer product we record the position of the entries smaller than c in C Both i and j are 
always incremented or respectively decremented, thus the running time is linear in n. We need to explicitly 
store only it, v and C, this gives the claimed space usage. □ 
Next we present an efficient approach for updating the summary for a given outer product after finding 
the entries of rank at most b. 

Lemma 5 For a given outer product uv, «,ti£ R + " we update the summary S in time 0(b + sort(n)). 
Proof: 

We first sort the vectors u and v in decreasing order according to the values Ui and Vi, respectively. Let us 
call the sorted vectors u 8 and v s . Each entry in u s and v s will be of the form (val,pos) such that u pos = vol 
and v pos — vol, respectively, i.e. pos will record the position of a given value in u and v. We define the entry 
(«, j) in the outer product u s v s as a (val u val v , (pos u ,pos v )) such that uf = (val u ,pos u ) and u S j = (val v ,pos v ). 
Comparing the entries on the val u val v values, we can assume that we compute the outer product of two 
sorted vectors. 

Assume we have computed the data structure C implicitly representing the largest b entries in u s v s , as 
shown in Lemma [4] Now we show how to update the summary with the entries in C in time 0(b + sort(n)). 
We introduce a position total order on entries such that (ix,ji) < (*2, J2) iff hn + j\ < i2n + J2, h j £ [n]. 
We will keep the entries in S in the summary sorted according to this order. Assume we can output the 
b heaviest entries from a given outer product sorted according to the position total order in C. Then in a 
merge-like scan through S and C we update the entries in S n C, remove those from C and obtain a sorted 
data structure containing the entries from S and C in 0(b) steps. The entry of rank b + 1 in the set C U S, 
which has size at most 26, can be found in 0(b) by |4. Thus, if the entries in C are sorted according to the 
position total order, updating the summary will run in 0(b) steps. 

We output the b heaviest entries sorted according to the position total order by the following algorithm. 
Let C be implicitly given as a sorted column vector u s and a sorted row vector v s as described above, and 
i < n integer pairs (g, r q ) denoting that in the qih row in the outer product u s v s the first r q > entries have 
rank not more than b. Clearly, r q will monotonically decrease as q increases. We start with q = £, namely the 
shortest interval, sort the r q entries according to the position total order. We then decrease q by 1 and sort 
the next r q _i entries according to the position total order. However, we observe that due to monotonicity 
r 9 _i > r q and all elements from v s appearing in the qth row of u s v s also appear in the (q — l)th row. Thus, 
we can sort only the new r 9 _i — r q elements and then merge the result with the already sorted r q elements. 
We continue like this until the elements in each row of the outer product have been sorted. Then we sort 
the elements in the column vector u s according to their position, keeping a pointer to the corresponding 
sorted subinterval of v s for each entry in u s . From this we build the set £ with entries sorted according 
the position total order. By setting r^ + \ = the running time for the I mergings and sortings amounts to 
Di=o( ri + Sort(ri - r i+ i)). We can bound this sum by 0(b + Sort(n)) since Y1=q r i = J2i=o( r i ~ r m) < n 
and X)™=o 1 f( x i) — f(Y^i=o Xi ) ^ or a monotonically growing superlinear function and numbers Xi, i G [n], in 
its domain. □ 

3.3 Analysis of the approximation guarantee. 

The only remaining component is how to efficiently answer queries to the summary after processing all outer 
products. We use a static dictionary with constant look-up time. Observing that the entries are from a 
universe of size n 2 , the best known result by Ruzic [28] provides a construction in time (9(61og 2 log&) and 
space 0(b). Note that b < n 2 , therefore as a first result we observe that Lemmas [2] and [5] immediately yield 
the following 

Lemma 6 Given n x n-matrices A, B with non-negative real entries, there exists a deterministic algorithm 
approximating the weight of each entry in the product C of A and B within an additive error of \\C\\E 1 /b. 
The algorithm runs in time 0{nb + nSort(n)) and space 0(b + n) in one pass over the input matrices. 
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It was first observed by Bose et al. [5] that the Frequent algorithm guarantees tighter estimates for 
items with weight significantly larger than N/b in a stream of length TV and summary of size b. Berinde et 
al. [3] develop a general framework for the analysis of so called heavy-tolerant counter based algorithms and 
show that Frequent falls in this class. In order to keep the paper self-contained instead of referring to 
the results in [3] we show why our algorithm provides much better guarantees for skewed distributions than 
suggested by Lemma [6] Also, the analysis we present is simpler since it does not apply to a wide class of 
counter based algorithms. 

Let us first give some intuition why better approximation guarantees are possible. In order to bound the 
possible underestimation of an entry in the summary we assume that each decrement of a given entry is 
applied to at least b distinct entries. Since the total weight of all entries is HCH^ we obtain that the total 
weight charged to a given entry is bounded by ||C|| j e 1 /6. If we had only 0(b) different entries all having 
weight about HCH^/6, then the approximation error given by Lemma [6] is tight. However, assume some 
entries have weight oHCH^ for a > 1/6. This means that the total weight we might have charged when 
decrementing the weights of groups of at least b distinct entries can be bounded by ||C||si — [ct — 1/&)||C7||.e 1 . 
Iteratively applying the argument to all heavy entries we arrive at improved approximation guarantees. The 
results in [31 [S] are based upon the above observation. 

Lemma 7 (Bose et al, For an entry (i,j) in C = AB with weight ct\\C\\Ei, otb > 1, after termination 
of ComputeSummary it holds Cij > — (1 — ol)\\C\\ei/ {b — 1) where Cij is the approximation of Cjj 
returned by EstimateEntry(i, j) . 

Proof: By Lemma 1 we have that the underestimation is at most jj El . As outlined above this implies that 
we have a total weight of at most || C|| Si (ct — \) to charge from, which implies a new upper bound on the 
underestimation of HCH^ — HCH^a — = ^ C ^ E ^ 1 + " C ^ El , Successively applying the above reasoning 

k times we can bound the underestimation to Yli=i ^ "^P^ 1 + ■ Since b > 1 for k — > oo the claim 

follows. □ 

Lemma 8 (Berinde et al, ||C||^;fc 1 /(6 — k) is an upper bound on the underestimation of any Cij returned 
by EstimateEntry(j, j) for any k <b. 

Proof: Assume that S is an upper bound on the underestimation returned by EstimateEntry. We apply 
again the above reasoning and since k < b we see that the underestimation is bounded by ^tli^ilMi We can 

continue iterating in this way setting Si = fc<5 '~ 1+ ^ c '^ Efcl while Si < for the underestimation in the ith 
step holds. We either reach a state where no progress is made or, since the underestimation is lower bounded 
by HCII^fc!, we have Si — > S as i — > oo. In either case the claim follows. □ 

The above lemmas are important since they yield approximation guarantees depending on the residual 
fc-norm of the matrix product, thus for skewed matrix products the approximation is much better than the 
one provided by Lemma [6] 

Sparse recovery. The approximation of the matrix product C = AB in [TTI 1271 125] is analyzed in 
terms of the Frobenius norm of the difference of C and the obtained approximation C, i.e ||C — C\\p. By 
simply creating a sparse matrix with all non-zero estimations in the summary we obtain an approximation of 
C: the so called fc-sparse recovery of a frequency vector f aims at finding a vector f with at most k non-zero 
entries such that the p-norm ||f — f || p is minimized. 

As shown by Berinde et al. [3] the class of heavy-tolerant counter algorithms yields the best known bounds 
for the sparse recovery in the p-norm. The following Theorem 1 follows from Lemma [5] and their main result. 

Theorem 1 Let A, B be nonnegative n x n real matrices and C = AB their product. There exists a one-pass 
approximation deterministic algorithm returning a matrix C such that \\C — C\\e p < (1 + {s/k) 1 ~p || C|| £:*= l - 
The algorithm runs in time 0(n ■ Sort(n) + (nk)/e) and uses space 0(n + k/e) for any < e < 1 and k > 1. 

Proof: Let us write the entries of C in ascending order according to their weight as Cq,C\, . . . , C„2_ 1 , 
thus ||C|| B fc p = (X^Lfc 1 Observe that by setting b = k + | the bounds given by Lemma [s] yield 
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an underestimation for every entry of at most j!^ 1 . Thus, we can upper bound (||C — C\\e p ) p as 
k( ell °l Ekl )P + J^lll 1 \Ot - Q|(|)^ 1 (||C|| En )P- 1 . The last term is at most (|)^ 1 (l|C|Un) p , thus we 
can upper bound (||C — C|| p ) p to (1 + £)!(||C||£'=i) p - The time and space complexity follow directly from 
Lemma [5] □ 

Clearly, for k/e = o(n 2 ) the algorithm runs in subcubic time and subquadratic memory. In the next 
paragraph we show that for skewed output matrices EstimateEntry can provably detect the most significant 
entries even for modest summary sizes. 



Zipfian distributions. As discussed in Section [273] the assumption that the entries in the product adhere 
to a Zipfian distribution is justified. The results stated below not only give a better understanding of the 
approximation yielded by the algorithm, but also allow direct comparison to related work. 

Lemma 9 (Berinde et al, Jjfl) If the entries weights in the product matrix follow a Zipfian distribution with 
parameter z > 1, then EstimateEntry with a summary of size b 



1. approximates the weight of all entries with rank i < b with additive error of (1 — ((z)i z ) fc-i 1 • 

2. estimates the weight of all entries with additive error o/e||C||.Ei for b = 

3. returns the largest k entries in the matrix product for b — 0(k). 
4- returns the largest k entries in a correct order for b = Q(k(^)i). 

Proof: The first statement is a direct application of Lemma Pf] For 2. we choose k — 6/2 in Lemma [8| The 

error is then bounded by " "f^ 1 . By plug ging in the bounds on the tail of the Zipfian distribution we 

ii oil 

obtain that the error is bounded by 0( 11 fc 'l E1 ). This implies that there exists b — c^k such that all entries 
of weig ht ll£|fS for some constant q,, will be in the summary. This shows 3. Since the entry of rank k 
has weight jl^j^ and we assume that £(z) is constant. For the last statement we observe that the function 
fix) = — j^xy is monotonically decreasing for x > and z > 1. Thus we need an error of at most 

25 < ||C||bi(^ — (fc+i)* )• With some algebra we can bound the difference to Q( ^^|fe ), thus by 2. the 
claim follows. □ 

3.4 Comparison to previous work. 

The randomized algorithm by Cohen and Lewis [12J for computing the product of nonnegative matrices 
yields an unbiased estimator of each entry and a concentration around the expected entry weight with 
high probability. However, their algorithm requires a random walk in a bipartite graph of size Q(n 2 ) space 
and is thus not space efficient. It is difficult to compare the bounds returned by EstimateEntry to the 
bounds obtained in [171 129] , but it is natural to compare the guarantee of our estimates to the ones shown by 
Pagh |2Z| . 

The approximation error of the matrix estimation C in [57j, \\C — C\\f, is bounded by (n||C||i?)/ Vb 
with high probability. The running time is 0(n 2 logn + b log b log n) and space usage is 0(n + blogn). 
Our deterministic algorithm achieves an error guarantee of (1 + e)(s/k) 1 ~p\\C\\ E k 1 for the approximation 
\\C — C\\e p for any p > 0. For a direct comparison we set p — 2, k = and b = fl/e] and obtain an 
approximation error of ||C||bi/Vo which is at most (n||C||i?)/-\/6 by Cauchy-Schwarz inequality. The time 
and space complexity of our algorithm is a polylogarithmic factor better. Note also that the approximation 
guarantee does not depend on the dimension n as in [27] , 

For individual entries we achieve an error bounded by min feg [ b ] ||C|| £ ;fc 1 /(6 — k) while [2 7) shows that the 
error of the obtained estimates is bounded by ||C'|| £ ;6/ f s 1 /v / 5 for a suitably chosen constant k > 2. 

Assuming Zipfian distribution with z > 1 the approximation error of the Frobenius norm of the matrix 
product in |27j is bounded by 0(nb~ z \\C\\Ei) with high probability. By setting k — our deterministic 
algorithm achieves 0(||C||.ei/\/&~) for the Frobenius norm approximation error. For an £||C||£i-approximation 
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Figure 3: We compare the theoretical error guarantee achievable by Pagh's j27j and our algorithm for the 
estimation of individual entries for sketch of size b for the dataset accidents. 



of individual entries both |27] and our algorithm need a data structure, a sketch or a summary, of size 
0((l/e)i) but [21] needs to run O(logn) copies of the algorithm in parallel in order to guarantee that 
the estimates are correct with high probability. Figure [3] plots the additive approximation achieved for 
different summary sizes for the accidents dataset. The dashed line gives the theoretical guarantees on the 
approximation error shown in |27j . depending on the entrywise 2-norm of the matrix product, and the solid 
line the approximation we achieve (depending on the entrywise 1-norm). In order to achieve guarantees with 
high probability Pagh's algorithm needs to run O(logn) copies of the algorithm in parallel, thus increasing 
the time and space complexity by a factor of O(logn). However, Pagh's algorithm achieves better bounds for 
lighter skew when 1/2 < z < 1 and more important it is not restricted to nonnegative input matrices. 



4 An algorithm for arbitrary real-valued matrices 

In this section we show how to efficiently extend the deterministic streaming algorithm sketched in [141 126) to 
matrix multiplication. The algorithm in [141 126] works for streams in the non-strict turnstile model where 
updates are of the form (i, v) for an item i and »€K. 

Let us first consider the generalized Majority algorithm for the non-strict turnstile model. We want to 
find an item whose absolute total weight is more than half of the absolute sum of total weights of all other 
items in the stream. Recall that in the non-strict turnstile model an item is allowed to have negative weight 
after processing the stream. We adapt the key observation in [H] to obtain 

Lemma 10 A majority item in a real weighted stream over a domain of size m in the non-strict turnstile 
model can be determined in two passes using [log m] + 1 counters and processing time per item arrival. 

Proof: Assume the items are integers over some finite domain [to]. We form [logm] groups <?fc, < fc < 
[logm] — 1, one for each bit of the binary representation for each item i € [to]. For each arriving item i we 
compute its fcth bit and if it is 1 we add i's weight to the counter Ck of the fcth group. We also maintain 
the total weight of the stream seen so far in a global variable w. After the stream has been processed we 
first determine the sign of w. Each item either increases the weight of the fcth group or does not change 
it. Since the set of items is divided in two disjoint subsets for the fcth group depending on the fcth bit of 
each item, given w and Cfe we can also determine the total weight of items whose fcth bit is equal to 0, call 
it Wq. First observe that an item with absolute weight more than |iu|/2 must have the same sign as w. 
Thus, if the weights in the two groups for the fcth bit have different signs we determine the fcth bit of the 
potential majority item. Otherwise, if the 0-group and the 1-group have weight with equal sign, the majority 
element must be in the subset with larger absolute weight. We show this by contradiction. Set w\ := cj.. 
Assume w.l.o.g. that both groups have positive weights Wq > 0, > 0, Wq > and the fcth bit of the 
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majority item is 1. Let the weight of the majority item m be w m . We have w k > w m — w_ 1 where u>^ 1 
is the absolute value of the total contribution of entries with negative total weight and kth bit 1. The last 
inequality follows from the fact that w m is positive and the sum of weights of items with positive total weight 
is lower bounded by w m . But this implies w m < w k + w k _ 1 < Wq + w^L 1 which contradicts that m is the 
majority item. Therefore it must hold Wq < w\. 

After the first pass we can construct a candidate majority item from the [log m~\ bits found. A second 
pass over the stream will reveal whether indeed the candidate is a majority item. □ 

We generalize the above algorithm to finding a majority entry in a matrix product. Generalizing again builds 
upon the column row method. 

Lemma 11 A Majority entry in a matrix product can be computed in time 0(n 2 logn), linear space and two 
passes over the input. 

Proof: Let us assume n = 2 i for some integer £ > 0. We number the n 2 entries of the matrix product as 
0, 1, . . . , n 2 — 1 such that the entry in the position (i, j) is assigned a number i2 l + j, < i, j < n — 1. Now 
observe that each entry of the matrix product consists of 2£ bits and the term j determines only the least 
significant £ bits while the term i2 l determines the most significant i bits. In the sequence 0, 1, . . . , n 2 — 1 the 
elements having 1 in the fcth position, < k < 21— 1, are the ones in positions {2 k +i2 k+1 , . . . , (i + l)2 k+1 — 1}, 
< i < 2 2e ~ k ~ 1 . Thus, given a column vector a of A and a row vector b of B the entries in the outer product 
ab with the fcth bit equal to 1 are uniquely determined by the position of the contribution from either a or 
b. More concretely, for k < £ and k > £ these are the entries in positions {2 k + i2 k+1 , . . . , (i + l)2 k+1 — 1}, 
< i < 2 e ~ k ~ 1 , in a and b, respectively. 

Thus, to find the total contribution of entries weights with a bit set to 1 in a given position we need to 
simply consider only the entries in the corresponding interval and by the distributive law for the ring of real 
numbers we can compute the total contribution to the group for the ith 1-bit in 0(n) steps. Writing the 
matrix product as a sum of n outer products and applying to the resulting stream the Majority algorithm 
from Lemma [lO] with the above outlined modification yields the claimed running time. □ 

Theorem 14 in }26j presents a generalization of the Majority algorithm for non-strict turnstile data streams 
where at most k items have a value significantly different from after processing the stream. We combine the 
approach with the technique presented by Pagh [27 to obtain our main theorem for the multiplication of 
arbitrary real valued matrices. The proof of the theorem analyzes the complexity of the algorithm in Figure [3] 

Theorem 2 Let A, B be real n x n matrices and C = AB their product. If the absolute weight of each of 
the b entries with largest absolute weight is bigger than \\C\\Ebi, then there exists a deterministic algorithm 
computing the b heaviest entries exactly in time 0(n 2 + nb 2 log 3 nlog 2 b) and space 0(n + b 2 log 3 nlogfr) in 
two passes over the input. 

Proof: We first describe the approach presented in [26] for data streams over the non-strict turnstile model 



and then adapt it to matrix multiplication using Lemma 11 and the approach from [27] ■ 

Let P = {pi,P2, • • • ,Px} be a set of consecutive prime numbers such that b < p\ < P2 < ■ ■ ■ < p x and 
x = (b — 1) log b N + 1 where N is the cardinality of the set of items in the input stream. Assume that at 
most b items will have weights different from 0. For each prime number p we create p groups labelled as 
0, 1, . . . ,p — 1. Now, for a given p, each incoming item i is distributed to exactly one of the p groups by 
computing i mod p. The crucial observation is that for any subset of b non-zero items each one of them will 
be isolated in at least one group. This is true because two different items can share at most log h ./V groups for 
different prime numbers. Otherwise the absolute value of their difference modulo p is for log b N + 1 distinct 
primes larger than b which contradicts the fact that all items are smaller than N. Thus, for a given non-zero 
item i each of the other b — 1 non-zero items "locks" at most logf, N groups for different primes. Since we 
have {b — 1) log b N + 1 distinct primes there must exist a prime p such that i mod p = r( mod p) implies 



j mod p ^ r( mod p) for all j e B,j ^ i. Thus running the Majority algorithm from Lemma 10 for each 
group and checking the results in a second pass will reveal the b non-zero items. Clearly, the algorithm is 
resilient to noise in the zero- valued items. 

We now adjust the approach to matrix multiplication with sparse output. Figure [4] presents a pseudocode 
description of the algorithm. Given n x n real input matrices A and B and a parameter b we assume we 
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function ComputeGroups 

Input: matrices A,B £ M. nxn , an array of primes P, integer b 
1: Create data structures Q of size XxY and G of size X x Yx L for X = c^felog n,Y = c y blogn log b,L — 2£ 
for suitably chosen constants c x and c y and £ = [log n~\ 
for i £ [n] do 

a = b = Bi * 
for j := to \P\ do 
prime p := P[j] 



2 
3 
1 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 



Pa = Et=0 a tX t n m ° d P 
= Et=0 P 

p ah = FFT{p ai p b ) 

Pab = pab mod x p + pab div x p 

for m € [p] do 

<2[j][m] := Q[j][m] + c m for c m x m £ p^ 
for k £ [2£] do 

if k < I then 

set to the entries a t for t £ {i2 k+1 , ...,(« + 1)2' £+1 - 2 fc - 1}, < i < 2 2i - k - 1 

else 

set to the entries b t for t £ {i2 k+1 , . . . , (i + l)2 k+1 - 2 k - 1}, < i < 2 2£ - k - 1 

Pa = Et=o a t^' n m ° d P 

Pb = Et=o b tx l mod p 

p a b = FFT(p ai p b ) 

Pal, = pab mod x p + Pab div x p 

for m £ [p] do 

G[j][m}[k] = G[j][m}[k] + c m for c m x m £ p^ 



function CheckCandidates 

Input: Data structures G and Q, array of prime numbers P 
1: Initialize a X x Y matrix K with 2^-bitstrings for X = c x blogb, Y = c y b\ogn\ogb 
2: for j := to \P\ do 
3: prime p := P[j] 
4: for m £ [p] do 
5: for k £ [21] do 

6: Set fcth bit of if[j][m] depending on the values Q[j][m] and G[j][m][fc] 

7: for k £ K do 

8: set i from the first I bits in k 
9: set j from the last I bits in k 
10: Check the value of AB(i,j) 

Figure 4: The function ComputeGroups distributes the entries to different groups depending on their 



mod value for a prime number p. In each group we update 21 + 1 subgroups as outlined in Lemma 11 In 
CheckCandidates we construct a candidate entry for each group and find its exact weight in the matrix 
product. 



have found the (6—1) log b n 2 + 1 consecutive prime numbers larger than 6, denote them as V. As in the 
algorithm for nonnegative input matrices we iterate over n outer products and consider each of them as 
n 2 distinct item arrivals. We iterate over primes p £ V '. We concentrate how ComputeGroups works for 
a fixed prime p £ V and an outer product ab for a = A* i,b = -B^*, i £ [n]. We want to distribute each 
of the (ijj) entries in ab in p groups depending on i,j and p and then run the Majority algorithm from 
Lemma 10 in each group. We write each of the n 2 entries as i ■ n + j, for i, j £ [n]. Since for any two 
Tlx, 1^2 £ N we have {n\ +ri2) mod p = (rii mod p + n-2 mod p) mod p we want to compute Ej j a ibj such that 
(i ■ n mod p + j mod p) mod p = I for each £ £ [p] . The naive solution would explicitly compute the mod p 
value of each of the n 2 entries and update the corresponding group. As observed by Pagh |27j . however, one 
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can treat the two vectors as polynomials of degree p — 1: p a — J2i=o a % x% n mod p an d Pb = Sj=o fyx* mod p , 
lines 6-7. In line 8 we multiply the two polynomials in time 0(p\ogp) using FFT Q The product is 
PaPb — Sfc=o ^ Ckxk wnere c k — J2i j a ibj such that (i ■ n mod p) + (j mod p) = k, thus we only have to add 
up the entries with the same exponents modulo p: (p a Pb mod x p + p a Pb div x p ). In line 11 we update the 
global counter from Lemma [lO] recording the total contribution of entries weights in the stream and store it 
in a matrix Q in a corresponding cell. Next, in lines 12-22, we repeat essentially the same procedure for 
each of 21 bits of each entry: we nullify the respective entries in either a or b, those with a in the kth 



bit, as outlined in Lemma 11 Then we multiply the updated polynomials and add to the counter of the 
corresponding (sub)groups the newly computed coefficients, stored in a data structure G. 

After the first pass over the input matrices we have to construct a majority candidate for each of the p 
subgroups of a prime p € V and check whether it is indeed different from in a second pass over the input 
matrices. In CheckCandidates we do this by using the previously computed data structures G and Q from 
which we extract the required information as outlined in Lemma [XT] 

Correctness of the algorithm follows from the above discussion and the previous lemmas. 

For the complexity analysis we need to know the cardinality of V and the order of the largest prime in V . 
In the following we assume b > 2. We have {Vl = 0(blog b n) = 0(6 log n) and from the prime number theorem 
we observe that for the largest p € V we have p = 0(61og fc n(log6 + loglog b n)) = 0{b\ogb\ogn). Therefore, 
for a given outer product we iterate over O(felogn) primes and in each iteration O(logn) times we multiply 
using FFT two polynomials of degree 0(61og61ogn). This yields a total running time of 0(b 2 log 3 nlog 2 b). 
For the data structures G and Q we need a total space of 0(b 2 log 3 nlog b). CheckCandidates needs time 
of 0(nb 2 log 3 nlog 6) to find the b nonzero entries in the product. □ 



The above theorem immediately implies the following result for sparse matrix products: 

Corollary 1 Let A, B be real n x n matrices and C = AB their product. If C has at most b nonzero entries 
then there exists a deterministic algorithm computing C exactly in time 0(n 2 + nb 2 log 3 n log 2 b) and space 
0(n + b 2 log 3 nlog b) in two passes over the input. 



4.1 Zipfian distribution. 

For the case when the absolute values of the entries in the outer product adhere to Zipfian distribution with 
parameter z > 1 we obtain the following 

Theorem 3 Let the absolute values of the entries weights in a matrix product adhere to Zipfian distribution. 
Then for user- defined s > and k > there exists a deterministic algorithm detecting the ks heaviest entries 
in the product in time 0(s(n 2 + nk^ 1 log 3 nlog 2 k)) and space 0(n + ks + fc 7 ^ 1 log nlogfc) in 2s passes 
over the input matrices. 

Proof: We run ComputeGroups with suitably chosen parameters such that an entry with absolute 
weight of at least / := ^jygi lands in at least one group such that the contribution from other entries is 

2 

less than /. Thus, we have to isolate the x heaviest entries in different groups such that ^ > Yli= x +i h- 

Since Y]^— x . i jk < cx x ~ z for some constant c, we need ^ > cx x ~ z x > ck^ 1 . In a second pass we run 
CheckCandidates and find the k heaviest entries. 

A natural generalization for detecting the heaviest entries in s passes for a user-defined s works as follows. 
In one pass we will run ComputeGroups and in the subsequent pass CheckCandidates. Assume after 
the 2ith pass, 1 < i < s — 1, we have found the ik entries with largest absolute weight and they are stored in 
a set S. Then, in the (2i + l)th pass we run ComputeGroups but at the end we subtract the weights of all 
entries in S from the counters in the corresponding groups in order to guarantee that the ik heaviest entries 
are not considered. After finding the new k heaviest entries we add them to S. □ 

^^Note, that the standard FFT algorithm assumes the highest order is a power of 2, thus we have to add < p high order terms 
to p a and p;, with coefficients equal to to obtain the product p a Pb- 
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4.2 Comparison to previous work. 

The algorithm seems to be only of theoretical interest. Note that its complexity is much worse than the 
one achieved by Pagh's randomized one-pass algorithm [27] : the b non-zero entries can be found in time 
0(n 2 + nblogn) and space 0(n + blogn) with error probability 0(l/poly(n)). Nevertheless since the best 
known space lower bound for finding b non-zero elements by a deterministic algorithm is 0(blogn) there 
seems to be potential for improvement. For example Ganguly and Majumder |19j present improvements of 
the deterministic algorithm from |26| but their techniques are not suitable for our problem. 

To the best of our knowledge this is the first deterministic algorithm for computing matrix products in 
time 0{n 2+E ) for the case when the product contains at most 0(y/n) non-zero entries. The algorithm by Iwen 
and Spencer achieves this for an arguably more interesting class of matrix products, namely those with n° , 
(3 < 0.29462, nonzero entries in each row, but the algorithm relies on fast rectangular matrix multiplication 
and its simple version runs in time 0(n 2+l3 ). 

Acknowledgements. I would like to thank my supervisor Rasmus Pagh for his continuous support 
and many valuable comments and suggestions. 
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